The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

نویسندگان

Marco Baroni

Silvia Bernardini

Adriano Ferraresi

Eros Zanchetta

چکیده

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Construction of High-Quality Web Corpora

In this article, we give an overview about the necessary steps to construct high-quality corpora from web texts. We first focus on web crawling and the pros and cons of the existing crawling strategies. Then, we describe how the crawled data can be linguistically pre-processed in a parallelized way that allows the processing of web-scale input data. As we are working with web data, controlling ...

متن کامل

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results f...

متن کامل

Building Large Corpora from the Web Using a New Efficient Tool Chain

Over the last decade, methods of web corpus construction and the evaluation of web corpora have been actively researched. Prominently, the WaCky initiative has provided both theoretical results and a set of web corpora for selected European languages. We present a software toolkit for web corpus construction and a set of siginificantly larger corpora (up to over 9 billion tokens) built using th...

متن کامل

Large Linguistically-Processed Web Corpora for Multiple Languages

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmati...

متن کامل

On Bias-free Crawling and Representative Web Corpora

In this paper, I present a specialized opensource crawler that can be used to obtain bias-reduced samples from the web. First, I briefly discuss the relevance of bias-reduced web corpus sampling for corpus linguistics. Then, I summarize theoretical results that show how commonly used crawling methods obtain highly biased samples from the web. The theoretical part of the paper is followed by a d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Language Resources and Evaluation

دوره 43 شماره

صفحات -

تاریخ انتشار 2009

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

نویسندگان

چکیده

منابع مشابه

Scalable Construction of High-Quality Web Corpora

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Building Large Corpora from the Web Using a New Efficient Tool Chain

Large Linguistically-Processed Web Corpora for Multiple Languages

On Bias-free Crawling and Representative Web Corpora

عنوان ژورنال:

اشتراک گذاری